Sentiment Analysis¶
In this notebook, we’re going to learn how to use VADER (Valence Aware Dictionary and sEntiment Reasoner), a sentiment analysis tool designed for social media.
We’re going to see how well VADER works with our own sentences and with sentences from The House on Mango Street. Can we create an accurate plot arc of Sandra Cisneros’s coming-of-age novel?
Install and Import Libraries/Packages¶
Import Pandas and set Pandas display column width to 400 characters
import pandas as pd
pd.options.display.max_colwidth = 400
Install vaderSentiment package with pip
!pip install vaderSentiment
Collecting vaderSentiment
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
|████████████████████████████████| 125 kB 2.8 MB/s eta 0:00:01
?25hRequirement already satisfied: requests in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from vaderSentiment) (2.23.0)
Requirement already satisfied: idna<3,>=2.5 in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from requests->vaderSentiment) (2.9)
Requirement already satisfied: certifi>=2017.4.17 in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from requests->vaderSentiment) (2020.12.5)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from requests->vaderSentiment) (1.25.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from requests->vaderSentiment) (3.0.4)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
WARNING: You are using pip version 20.3.3; however, version 21.0.1 is available.
You should consider upgrading via the '/Users/melaniewalsh/opt/anaconda3/bin/python3 -m pip install --upgrade pip' command.
Import the SentimentIntensityAnalyser and initlaize it
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentimentAnalyser = SentimentIntensityAnalyzer()
Calculate Sentiment Scores¶
To calculate sentiment scores for a sentence or paragraph, we can use the .polarity_scores() method.
sentimentAnalyser.polarity_scores("I like the Marvel movies")
{'neg': 0.0, 'neu': 0.361, 'pos': 0.639, 'compound': 0.6486}
sentimentAnalyser.polarity_scores("I don't like the Marvel movies")
{'neg': 0.526, 'neu': 0.474, 'pos': 0.0, 'compound': -0.5334}
sentimentAnalyser.polarity_scores("I don't *not* like the Marvel movies")
{'neg': 0.255, 'neu': 0.546, 'pos': 0.199, 'compound': -0.1307}
Your Turn!¶
Try out the sentimentAnalyzer on some sentences of your own!
Experiment with capitalization, punctuation, emojis, historical words, slangy language, poetry, or non-English words. How does VADER handle it? What does VADER seem to do well and not so well?
#Your code here
{'neg': 0.0, 'neu': 0.361, 'pos': 0.639, 'compound': 0.6486}
#Your code here
Calculate Sentiment Scores for The House on Mango Street¶
To calculate sentiment scores for The House on Mango Street, we first need a quick-and-easy way to break the novel up into sentences.
Install and Import NLTK¶
Install NLTK, a Python library for text analysis natural language processing
!pip install nltk
Requirement already satisfied: nltk in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (3.5)
Requirement already satisfied: click in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from nltk) (7.1.2)
Requirement already satisfied: regex in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from nltk) (2018.1.10)
Requirement already satisfied: joblib in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from nltk) (0.14.1)
Requirement already satisfied: tqdm in /Users/melaniewalsh/opt/anaconda3/lib/python3.7/site-packages (from nltk) (4.46.0)
WARNING: You are using pip version 20.3.3; however, version 21.0.1 is available.
You should consider upgrading via the '/Users/melaniewalsh/opt/anaconda3/bin/python3 -m pip install --upgrade pip' command.
Import nltk and download the model that will help us get sentences
import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to
[nltk_data] /Users/melaniewalsh/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
Load Text and Break Into Sentences¶
Read in the text file for “Hairs”
text_file = "../texts/literature/Pride-and-Prejudice_Jane-Austen.txt"
chapter = open(text_file, encoding="utf-8").read()
To break a string into individual sentences, we can use nltk.sent_tokenize()
nltk.sent_tokenize(chapter[:1000])
['\nPRIDE & PREJUDICE.',
'CHAPTER I.',
'It is a truth universally acknowledged, that a single man in possession\nof a good fortune, must be in want of a wife.',
'However little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds\nof the surrounding families, that he is considered as the rightful\nproperty of some one or other of their daughters.',
'"My dear Mr. Bennet," said his lady to him one day, "have you heard that\nNetherfield Park is let at last?"',
'Mr. Bennet replied that he had not.',
'"But it is," returned she; "for Mrs. Long has just been here, and she\ntold me all about it."',
'Mr. Bennet made no answer.',
'"Do not you want to know who has taken it?"',
'cried his wife impatiently.',
'"_You_ want to tell me, and I have no objection to hearing it."',
'This was invitation enough.',
'"Why, my dear, you must know, Mrs. Long says that Netherfield is taken\nby a young man of large fortune from the north of England; that he came\ndown']
sentences = nltk.sent_tokenize(chapter[:1000])
Calculate Scores for Each Sentence¶
We can loop through the sentences and calculate sentiment scores for every sentence.
How would we print just the “compound” score for each sentence?
for sentence in sentences:
scores = sentimentAnalyser.polarity_scores(sentence)
print(sentence, '\n', scores, '\n')
PRIDE & PREJUDICE.
{'neg': 0.494, 'neu': 0.122, 'pos': 0.384, 'compound': -0.2263}
CHAPTER I.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.
{'neg': 0.0, 'neu': 0.755, 'pos': 0.245, 'compound': 0.6705}
However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered as the rightful
property of some one or other of their daughters.
{'neg': 0.0, 'neu': 0.902, 'pos': 0.098, 'compound': 0.6147}
"My dear Mr. Bennet," said his lady to him one day, "have you heard that
Netherfield Park is let at last?"
{'neg': 0.0, 'neu': 0.885, 'pos': 0.115, 'compound': 0.3818}
Mr. Bennet replied that he had not.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
"But it is," returned she; "for Mrs. Long has just been here, and she
told me all about it."
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Mr. Bennet made no answer.
{'neg': 0.355, 'neu': 0.645, 'pos': 0.0, 'compound': -0.296}
"Do not you want to know who has taken it?"
{'neg': 0.12, 'neu': 0.88, 'pos': 0.0, 'compound': -0.0572}
cried his wife impatiently.
{'neg': 0.726, 'neu': 0.274, 'pos': 0.0, 'compound': -0.6486}
"_You_ want to tell me, and I have no objection to hearing it."
{'neg': 0.152, 'neu': 0.759, 'pos': 0.09, 'compound': -0.2263}
This was invitation enough.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
"Why, my dear, you must know, Mrs. Long says that Netherfield is taken
by a young man of large fortune from the north of England; that he came
down
{'neg': 0.0, 'neu': 0.915, 'pos': 0.085, 'compound': 0.3818}
Make DataFrame¶
A convenient way to create a DataFrame is to make a list of dictionaries.
Below we loop through the sentences, calculate sentiment scores, and then create a mini-dictionary with the sentence and the compound score, which we append to the list sentence_scores.
sentence_scores = []
for sentence in sentences:
scores = sentimentAnalyser.polarity_scores(sentence)
sentence_scores.append({'sentence': sentence, 'score': scores['compound']})
To make this list of dictionaries into a DataFrame, we can simply use pd.DataFrame()
pd.DataFrame(sentence_scores)
| sentence | score | |
|---|---|---|
| 0 | \nPRIDE & PREJUDICE. | -0.2263 |
| 1 | CHAPTER I. | 0.0000 |
| 2 | It is a truth universally acknowledged, that a single man in possession\nof a good fortune, must be in want of a wife. | 0.6705 |
| 3 | However little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds\nof the surrounding families, that he is considered as the rightful\nproperty of some one or other of their daughters. | 0.6147 |
| 4 | "My dear Mr. Bennet," said his lady to him one day, "have you heard that\nNetherfield Park is let at last?" | 0.3818 |
| 5 | Mr. Bennet replied that he had not. | 0.0000 |
| 6 | "But it is," returned she; "for Mrs. Long has just been here, and she\ntold me all about it." | 0.0000 |
| 7 | Mr. Bennet made no answer. | -0.2960 |
| 8 | "Do not you want to know who has taken it?" | -0.0572 |
| 9 | cried his wife impatiently. | -0.6486 |
| 10 | "_You_ want to tell me, and I have no objection to hearing it." | -0.2263 |
| 11 | This was invitation enough. | 0.0000 |
| 12 | "Why, my dear, you must know, Mrs. Long says that Netherfield is taken\nby a young man of large fortune from the north of England; that he came\ndown | 0.3818 |
Let’s examine the sentences from negative to positive sentiment scores.
pp_df = pd.DataFrame(sentence_scores)
pp_df.sort_values(by='score')
| sentence | score | |
|---|---|---|
| 9 | cried his wife impatiently. | -0.6486 |
| 7 | Mr. Bennet made no answer. | -0.2960 |
| 0 | \nPRIDE & PREJUDICE. | -0.2263 |
| 10 | "_You_ want to tell me, and I have no objection to hearing it." | -0.2263 |
| 8 | "Do not you want to know who has taken it?" | -0.0572 |
| 1 | CHAPTER I. | 0.0000 |
| 5 | Mr. Bennet replied that he had not. | 0.0000 |
| 6 | "But it is," returned she; "for Mrs. Long has just been here, and she\ntold me all about it." | 0.0000 |
| 11 | This was invitation enough. | 0.0000 |
| 4 | "My dear Mr. Bennet," said his lady to him one day, "have you heard that\nNetherfield Park is let at last?" | 0.3818 |
| 12 | "Why, my dear, you must know, Mrs. Long says that Netherfield is taken\nby a young man of large fortune from the north of England; that he came\ndown | 0.3818 |
| 3 | However little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds\nof the surrounding families, that he is considered as the rightful\nproperty of some one or other of their daughters. | 0.6147 |
| 2 | It is a truth universally acknowledged, that a single man in possession\nof a good fortune, must be in want of a wife. | 0.6705 |
Calculate Sentiment Scores By Chapter¶
To calculate sentiment scores for the sentences in each chapter of The House on Mango Street, we need to read in each file indviidually.
Here we will import glob and Path, which allow us to get all the filenames for the chapters and extract the titles.
import glob
from pathlib import Path
Create a list of filenames for every .txt file in the directory
directory_path = "../texts/literature/House-on-Mango-Street/"
text_files = glob.glob(f"{directory_path}/*.txt")
Loop through each file in the "House on Mango Street" directory,
sentence_scores = []
# Loop through all the filenames
for text_file in text_files:
#Read in the file
chapter = open(text_file, encoding="utf-8").read()
#Extract the end of the filename
title = Path(text_file).stem
#Loop through each sentence in the
for sentence in nltk.sent_tokenize(chapter):
#Calculate sentiment scores for sentence
scores = sentimentAnalyser.polarity_scores(sentence)
#Make mini-dictionary with chapter name, sentence, and sentiment score
sentence_scores.append({'chapter': title,
'sentence': sentence,
'score': scores['compound']})
Let’s create a DataFrame from all these sentences
chapter_df = pd.DataFrame(sentence_scores)
# Make the DataFrame alphabetical by chapter
chapter_df = chapter_df.sort_values(by='chapter')
How would we examine the most negative 15 sentences?
chapter_df.sort_values(by='score')[:15]
| chapter | sentence | score | rolling_mean | |
|---|---|---|---|---|
| 1028 | 17-The-Family-of-Little-Feet | Bum man is yelling something to the air but by now we are running fast and far away, our high heel shoes taking us all the way down the avenue and around the block, past the ugly cousins, past Mr. Benny’s, up Mango Street, the back way, just in case. | -0.8519 | 0.085570 |
| 0 | 23-Born-Bad | Born Bad\n\n\nMost likely I will go to hell and most likely I deserve to be there. | -0.8442 | -0.032880 |
| 1305 | 22-Papa-Who-Wakes-Up-Tired-in-the-Dark | Papa\n\nWho Wakes Up\n\nTired\n\nin the Dark\n\n\nYour abuelito is dead, Papa says early one morning in my room. | -0.8020 | 0.115703 |
| 173 | 12-Those-Who-Don’t | They are stupid people who are lost and got here by mistake. | -0.7964 | 0.052530 |
| 426 | 13-There-Was-an-Old-Woman-She-Had-So-Many-Children-She-Didn’t-Know-What-to-Do | But after a while you get tired of being worried about kids who aren’t even yours. | -0.7684 | -0.010923 |
| 315 | 16-And-Some-More | Anita, Stella, Dennis, and Lolo …\n\nWho you calling ugly, ugly? | -0.7650 | 0.136313 |
| 565 | 18-A-Rice-Sandwich | she said, pointing to a row of ugly three-flats, the ones even the raggedy men are ashamed to go into. | -0.7506 | 0.050130 |
| 422 | 13-There-Was-an-Old-Woman-She-Had-So-Many-Children-She-Didn’t-Know-What-to-Do | They are bad those Vargases, and how can they help it with only one mother who is tired all the time from buttoning and bottling and babying, and who cries every day for the man who left without even leaving a dollar for bologna or a note explaining how come. | -0.7506 | -0.020310 |
| 47 | 23-Born-Bad | I hated to go there alone. | -0.7351 | -0.102603 |
| 957 | 06-Our-Good-Day | Past my house, sad and red and crumbly in places, past Mr. Benny’s grocery on the corner, and down the avenue which is dangerous. | -0.7351 | 0.130330 |
| 1189 | 14-Alicia-Who-Sees-Mice | Alicia, whose mama died, is sorry there is no one older to rise and make the lunchbox tortillas. | -0.7269 | -0.081270 |
| 1410 | 10-Louie,-His-Cousin-&-His-Other-Cousin | Marin screamed and we ran down the block to where the cop car’s siren spun a dizzy blue. | -0.7269 | 0.060390 |
| 672 | 11-Marin | But next year Louie’s parents are going to send her back to her mother with a letter saying she’s too much trouble, and that is too bad because I like Marin. | -0.7227 | 0.112887 |
| 1261 | 38-The-Monkey-Garden | When I got back Sally was pretending to be mad … something about the boys having stolen her keys. | -0.7184 | -0.033563 |
| 1018 | 17-The-Family-of-Little-Feet | Now you know to talk to drunks is crazy and to tell them your name is worse, but who can blame her. | -0.7050 | 0.100773 |
How would we examine the most positive 15 sentences?
Full Text¶
import re
trump_tweets = pd.read_csv('../texts/politics/Trump-Tweets.csv')
sentiment_scores = []
for tweet in trump_tweets['text']:
scores = sentimentAnalyser.polarity_scores(tweet)
sentiment_scores.append(scores['compound'])
trump_tweets['sentiment_score'] = sentiment_scores
trump_tweets['date'] = pd.to_datetime(trump_tweets['created_at'])
trump_tweets['year'] = pd.to_datetime(trump_tweets['date'].dt.year, format='%Y')
#trump_tweets['year-month'] = trump_tweets['date'].dt.to_period('M')
#trump_tweets['Date (by month)'] = [month.to_timestamp() for month in trump_tweets['year-month']]
trump_tweets = trump_tweets.set_index('Date (by month)')
trump_tweets['rolling_mean'] = trump_tweets['sentiment_score'].rolling(50).mean()
trump_tweets['rolling_mean'].plot(style='.')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9c14d0c150>
import altair as alt
trump_tweets
| source | text | created_at | retweet_count | favorite_count | is_retweet | id_str | date | year | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Twitter for iPhone | Just finished a very good conversation with President Xi of China. Discussed in great detail the CoronaVirus that is ravaging large parts of our Planet. China has been through much & has developed a strong understanding of the Virus. We are working closely together. Much respect! | 03-27-2020 05:19:02 | 33074 | 202087 | False | 1243407157321560000 | 2020-03-27 05:19:02 | 2020-01-01 |
| 1 | Twitter for iPhone | Will be interviewed on @seanhannity at 9:10 P.M. @FoxNews | 03-27-2020 01:05:59 | 7419 | 42186 | False | 1243343475799720000 | 2020-03-27 01:05:59 | 2020-01-01 |
| 2 | Twitter for iPhone | The world is at war with a hidden enemy. WE WILL WIN! https://t.co/QLceNWcL6Z | 03-26-2020 23:50:02 | 24472 | 97346 | False | 1243324360523490000 | 2020-03-26 23:50:02 | 2020-01-01 |
| 3 | Twitter for iPhone | Our great Oil & Gas industry is under under seige after having one of the best years in recorded history. It will get better than ever as soon as our Country starts up again. Vital that it does for our National Security! | 03-26-2020 23:06:28 | 25514 | 131210 | False | 1243313399284500000 | 2020-03-26 23:06:28 | 2020-01-01 |
| 4 | Twitter for iPhone | Will be going out in 10 minutes for the press conference. | 03-26-2020 20:57:15 | 15797 | 130201 | False | 1243280878991790000 | 2020-03-26 20:57:15 | 2020-01-01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 29390 | Twitter Web Client | My persona will never be that of a wallflower - I’d rather build walls than cling to them --Donald J. Trump | 05-12-2009 14:07:28 | 1421 | 1950 | False | 1773561338 | 2009-05-12 14:07:28 | 2009-01-01 |
| 29391 | Twitter Web Client | New Blog Post: Celebrity Apprentice Finale and Lessons Learned Along the Way: http://tinyurl.com/qlux5e | 05-08-2009 20:40:15 | 8 | 27 | False | 1741160716 | 2009-05-08 20:40:15 | 2009-01-01 |
| 29392 | Twitter Web Client | Donald Trump reads Top Ten Financial Tips on Late Show with David Letterman: http://tinyurl.com/ooafwn - Very funny! | 05-08-2009 13:38:08 | 3 | 2 | False | 1737479987 | 2009-05-08 13:38:08 | 2009-01-01 |
| 29393 | Twitter Web Client | Donald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Think Like A Champion! | 05-05-2009 01:00:10 | 2 | 3 | False | 1701461182 | 2009-05-05 01:00:10 | 2009-01-01 |
| 29394 | Twitter Web Client | Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight! | 05-04-2009 18:54:25 | 253 | 202 | False | 1698308935 | 2009-05-04 18:54:25 | 2009-01-01 |
29395 rows × 9 columns
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
trump_tweets.plot(y='sentiment_score', x='retweet_count', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9c0211b950>
alt.Chart(trump_tweets).mark_circle(size=10).encode(
x='date',
y='rolling_mean',
color=alt.Color('rolling_mean', scale = alt.Scale(domain=[-.2, 0, .5],
range=['red', 'lightblue', 'darkblue'],type='linear')),
tooltip=['text', 'sentiment_score']).interactive()